Big Data Begets Big Database Theory
نویسنده
چکیده
Industry analysts describe Big Data in terms of three V’s: volume, velocity, variety. The data is too big to process with current tools; it arrives too fast for optimal storage and indexing; and it is too heterogeneous to fit into a rigid schema. There is a huge pressure on database researchers to study, explain, and solve the technical challenges in big data, but we find no inspiration in the three Vs. Volume is surely nothing new for us, streaming databases have been extensively studied over a decade, while data integration and semistructured has studied heterogeneity from all possible angles. So what makes Big Data special and exciting to a database researcher, other for the great publicity that our field suddenly gets? This talk argues that the novelty should be thought along different dimensions, namely in communication, iteration, and failure. Traditionally, database systems have assumed that the main complexity in query processing is the number of disk IOs, but today that assumption no longer holds. Most big data analysis simply use a large number of servers to ensure that the data fits in main memory: the new complexity metric is the amount of communication between the processing nodes, which is quite novel to database researchers. Iteration is not that new to us, but SQL has adopted iteration only lately, and only as an afterthought, despite amazing research done on datalog in the 80s [1]. But Big Data analytics often require iteration, so it will play a center piece in Big Data management, with new challenges arising from the interaction between iteration and communication [2]. Finally, node failure was simply ignored by parallel databases as a very rare event, handled with restart. But failure is a common event in Big Data management, when the number of servers runs into the hundreds and one query may take hours [3]. The Myria project [4] at the University of Washington addresses all three dimensions of the Big Data challenge. Our premise is that each dimension requires a study of its fundamental principles, to inform the engineering solutions. In this talk I will discuss the communication cost in big data processing, which turns out to lead to a rich collection of beautiful theoretical questions; iteration and failure are left for future research.
منابع مشابه
The b²/c³ Problem: How Big Buffers Overcome Convert Channel Cynicism in Trusted Database Systems
We present a mechanism for communication from low to high security classes that allows partial acknowledgments and flow control without introducing covert channels. By restricting our mechanism to the problem of maintaining mutual consistency in replicated architecture database systems, we overcome the negative general results in this problem area. A queueing theory model shows that big buffers...
متن کاملApplication and Exploration of Big Data Mining in Clinical Medicine.
OBJECTIVE To review theories and technologies of big data mining and their application in clinical medicine. DATA SOURCES Literatures published in English or Chinese regarding theories and technologies of big data mining and the concrete applications of data mining technology in clinical medicine were obtained from PubMed and Chinese Hospital Knowledge Database from 1975 to 2015. STUDY SELE...
متن کاملTowards NoSQL Graph Data Warehouse for Big Social Data Analysis
Big Data generated from social networking sites is the crude oil of this century. Data warehousing and analysing social actions and interactions can help corporations to capture opinions, suggest friends, recommend products and services and make intelligent decisions that improve customer loyalty. However, traditional data warehouses built on relational databases are unable to handle this massi...
متن کاملFuzzy Rough Set Conditional Entropy Attribute Reduction Algorithm
Modern science is increasingly data-driven and collaborative in nature. Comparing to ordinary data processing, big data processing that is mixed with great missing date must be processed rapidly. The Rough Set was generated to deal with the large data. The QuickReduct is a popular attribute algorithm as the attribute reduction of big database. But less effort has been put on fuzziness and vague...
متن کاملGeospatial Analytics for Big Spatiotemporal Data: Algorithms, Applications, and Challenges
Explosive growth in the spatial and spatiotemporal data and the emergence of social media and location sensing technologies emphasize the need for developing new and computationally efficient geospatial analytics tailored for analyzing big data. In this white paper, we review major spatial data mining algorithms by closely looking at the computational and I/O requirements and allude to few appl...
متن کامل